Photizo: an open-source library for cross-sample analysis of FTIR spectroscopy data

Abstract Motivation With continually improved instrumentation, Fourier transform infrared (FTIR) microspectroscopy can now be used to capture thousands of high-resolution spectra for chemical characterization of a sample. The spatially resolved nature of this method lends itself well to histological profiling of complex biological specimens. However, current software can make joint analysis of multiple samples challenging and, for large datasets, computationally infeasible. Results To overcome these limitations, we have developed Photizo—an open-source Python library enabling high-throughput spectral data pre-processing, visualization and downstream analysis, including principal component analysis, clustering, macromolecular quantification and mapping. Photizo can be used for analysis of data without a spatial component, as well as spatially resolved data, obtained e.g. by scanning mode IR microspectroscopy and IR imaging by focal plane array detector. Availability and implementation The code underlying this article is available at https://github.com/DendrouLab/Photizo with access to example data available at https://zenodo.org/record/6417982#.Yk2O9TfMI6A.


Introduction
Fourier transform infrared microspectroscopy (mFTIR) enables nondestructive and label-free mapping of complex chemical information. The spatially resolved nature of this method lends itself well to the analysis of architecturally complex samples such as those of biological nature (Baker et al., 2014). The functional group specificity of mFTIR provides insight into biological queries, capturing relevant molecules such as lipids, proteins, nucleic acids and carbohydrates (Bellisola and Sorio, 2012).
Continually improving instrumentation and spectral analysis methods are increasing the applicability of vibrational spectroscopy methods for disease characterization and diagnosis. These methods have repeatedly been shown to partition data based on these spectral features, distinguishing biochemical profiles of healthy control samples from pathological specimens (Heraud et al., 2010;Kneipp et al., 2000;Martel et al., 2020). In the context of histological characterization with mFTIR specifically, clustering has been leveraged to distinguish the biochemical profile of different degrees of pathology within a given sample, with performance being comparable to a trained pathologist (Wehbe et al., 2015).
Carrying out this type of analysis across multiple samples is challenging with currently available software. Commercially available options-while rich in analysis functionality-can be computationally costly to run, often limiting processing to one sample at a time. When samples are processed and analyzed individually with long running times for each step, this can increase errors and be less systematic, thereby potentially compromising reproducibility. Quasar, a recently available open-source spectroscopic data analysis toolbox extending the Orange suite, has overcome some of these challenges (Toplak et al., 2021). However, its interactive interface comes at the cost of the capacity for cluster computing-a necessity for analysis of large multi-sample datasets in a timely fashion.
With multi-modal analysis approaches gaining prominence in the life and medical sciences to aid biological discovery and provide insights for patient prognosis, diagnosis and therapy (Eddy et al., 2020;Miao et al., 2021;Palla et al., 2022), an open-source tool for streamlined analysis of mFTIR data could unlock the significant potential of this method to better characterize the relatively understudied biochemical profile of tissues, and the data generated could then be integrated with other data modalities. This would substantially increase the utility of mFTIR beyond data partitioning, and enable its use in disease characterization since multi-modal and integrative approaches enable streamlined data exploration and validation linking specific cellular processes to key macromolecular features.
In order to address this need, we present Photizo-an opensource Python library which makes use of the SCANPY library (Wolf et al., 2018) and AnnData objects to enable spectral analysis while preserving spectra-level clinical data annotation. It includes pre-processing, analysis and visualization functions, including spatial mapping of spectra ( Fig. 1).

Inputs and pre-processing
Data inputted into Photizo are read into a numpy array for pre-processing steps. Following pre-processing of each sample, subsequent steps can be performed for individual samples or for joint analysis of multiple samples. If using a data frame with multiple samples, we recommend creating an annotation data frame in pandas containing sample information (e.g. sample name, clinical data). This is necessary for visualization of clinical variables and of single-sample data.
Photizo pre-processing allows exclusion of outlier spectra with evidence of light scattering and spectra in regions with signal indicative of no sample (e.g. sample holes, regions outside of sample borders), enabling application of vector normalization to only spectra of interest. Positions of excluded spectra are saved for repopulation prior to spatial mapping. We recommend spatially verifying the position of excluded spectra to ensure consistency with histological features (e.g. holes).
Pre-processing also enables the exclusion of the CO 2 region, which is useful when the CO 2 captured is of atmospheric origin and does not contribute to the analysis. Excluding this region prior to clustering ensures that atmospheric alterations do not create batch effects. Calculating the second derivative of the spectra is also included in Photizo, which controls for baseline variation at the time of collection, thereby also minimizing batch effects in subsequent clustering.

Principal component analysis
Principal component analysis (PCA) can be used as a dimensionality reduction method and can be useful for identification of batch or spectral baseline effects prior to further analyses, and for discovering variables of genuine interest. Photizo has a PCA function optimized for spectral data, that rapidly generates cumulative explained variance plots and a plot of the top eigen-spectra. The PCA outputs can also be used for principal component projection and custom plotting.

Clustering
Photizo includes clustering tools which make use of uniform manifold approximation and projection dimensionality reduction (Becht et al., 2019) paired with the Leiden algorithm (Traag et al., 2019) for community detection. Clustering may be performed with entire spectra or with a particular region of interest using the region selection functions.

Visualization and quantification
Cluster profiling benefits from functions for visual spectral inspection. Tools for quantitative comparisons also contribute to cluster characterization, with functions implemented for numerical integration of the area below the spectra within the wavenumber window of interest. Selection of the window of interest may be verified with a specific spectral inspection function, whereby the user can account for subtle peak shifts in the data to select integration windows consistent with the collected data. Resulting quantified values can be used for statistical comparisons and visualized using violin plots.
Among the quantitative measures generated as outputs are estimates for secondary structure composition derived from the spectral features; these do not rely on spectral decomposition, but rather use statistically estimated content previously reported in the literature (Goormaghtigh et al., 2009), making this approach robust and reproducible.
Two key visualization functions in Photizo enable spatial mapping of data in the configuration of data collection, requiring only the number of spectra obtained in the x and y axes at time of collection. The first function maps integrated values for visualization of chemical content estimation across the tissue for a particular region of interest. The second enables spatial mapping of cluster classification. This feature is key for comparison with histological characterization and permits correlative analysis or integration (using machine learning-, topological-or tensor-based approaches, e.g.) with other spatially resolved molecular profiling methods applied to adjacent tissue sections, such as spatial transcriptomics, imaging mass spectrometry or spatial proteomics.

Example workflow and reference dataset
To facilitate the use of the library by new users, we have made available infrared imaging by focal plane array detector data, with spatially resolved spectra collected from brain sections for exploration of the library's functionality. This includes areas from three neurodegenerative disease cases and three controls, enabling performance of a full workflow with reference figures, data and metadata prior to using the library on their own data.

Conclusions
Here, we present Photizo, an open-source library for analysis of FTIR spectroscopy data, which includes functionality for analyzing spatially resolved mFTIR data. This library is built in Python-a popular programming language with noted code readability-enabling users to analyze FTIR data with more flexibility regarding sample number and data size than currently available options, all at a low monetary cost. Photizo streamlines analysis of multiple samples, including the option of joint sample analysis, making its methods reproducible and easy to standardize across samples and datasets. Being built on Python, it can also be used for scripts submitted to cluster computing, vastly reducing computational costs for analysis. It has flexible functionality, facilitating reusability of basic functions and can be easily integrated into further workflows or analyses (e.g. statistical comparison of quantitative findings), and may also be adapted to the analysis of other vibrational spectroscopy methods. Importantly, while certain tools utilized for Photizo come from biomedical sciences, the library is specimen-agnostic and can easily be used for spectral analysis of other sample types.
With the rise of integrative multi-modal analysis, this package contributes to closing the gap for mFTIR data to be analyzed as part of larger integrative studies, providing biochemical context for other omics technologies. Jointly, these features contribute to maximizing Fig. 1. Example workflow of mFTIR data in Photizo. The Photizo workflow includes pre-processing, PCA, clustering and cluster quantification and visualization solutions for FTIR spectroscopy and imaging data. WN, wavenumber the utility of spectroscopy data at lower costs, increased options for automation and streamlined but flexible processing of large datasets.